Executive Summary:

The following markdown file follows the exploratory data analysis performed on the employee information data set provided, titled CaseStudy2-data.csv. The purpose of this EDA is to predict Attrition and Monthly income based on the variables available within the data set. Any questions can be addressed to Chad Kwong at .

Introduction:

The markdown file starts with the initial loading/cleaning of the data and moves on to each aspect of the EDA. Each chunk of code is commented and a summary outline is provided before and after each chunk as well. Any questions can be addressed to Chad Kwong at .

Loading and cleaning up the data

# Initial Loading of data
AttritionData = read.csv("/Users/chadkwong/Desktop/SMU/GitHub/Case-Study-2/CaseStudy2-data.csv")
str(AttritionData)
## 'data.frame':    870 obs. of  36 variables:
##  $ ID                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age                     : int  32 40 35 32 24 27 41 37 34 34 ...
##  $ Attrition               : chr  "No" "No" "No" "No" ...
##  $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" ...
##  $ DailyRate               : int  117 1308 200 801 567 294 1283 309 1333 653 ...
##  $ Department              : chr  "Sales" "Research & Development" "Research & Development" "Sales" ...
##  $ DistanceFromHome        : int  13 14 18 1 2 10 5 10 10 10 ...
##  $ Education               : int  4 3 2 4 1 2 5 4 4 4 ...
##  $ EducationField          : chr  "Life Sciences" "Medical" "Life Sciences" "Marketing" ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  859 1128 1412 2016 1646 733 1448 1105 1055 1597 ...
##  $ EnvironmentSatisfaction : int  2 3 3 3 1 4 2 4 3 4 ...
##  $ Gender                  : chr  "Male" "Male" "Male" "Female" ...
##  $ HourlyRate              : int  73 44 60 48 32 32 90 88 87 92 ...
##  $ JobInvolvement          : int  3 2 3 3 3 3 4 2 3 2 ...
##  $ JobLevel                : int  2 5 3 3 1 3 1 2 1 2 ...
##  $ JobRole                 : chr  "Sales Executive" "Research Director" "Manufacturing Director" "Sales Executive" ...
##  $ JobSatisfaction         : int  4 3 4 4 4 1 3 4 3 3 ...
##  $ MaritalStatus           : chr  "Divorced" "Single" "Single" "Married" ...
##  $ MonthlyIncome           : int  4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
##  $ MonthlyRate             : int  9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
##  $ NumCompaniesWorked      : int  2 1 2 1 1 1 2 2 1 1 ...
##  $ Over18                  : chr  "Y" "Y" "Y" "Y" ...
##  $ OverTime                : chr  "No" "No" "No" "No" ...
##  $ PercentSalaryHike       : int  11 14 11 19 13 21 12 14 19 14 ...
##  $ PerformanceRating       : int  3 3 3 3 3 4 3 3 3 3 ...
##  $ RelationshipSatisfaction: int  3 1 3 3 3 3 1 3 4 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  1 0 0 2 0 2 0 3 1 1 ...
##  $ TotalWorkingYears       : int  8 21 10 14 6 9 7 8 1 8 ...
##  $ TrainingTimesLastYear   : int  3 2 2 3 2 4 5 5 2 3 ...
##  $ WorkLifeBalance         : int  2 4 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  5 20 2 14 6 9 4 1 1 8 ...
##  $ YearsInCurrentRole      : int  2 7 2 10 3 7 2 0 1 2 ...
##  $ YearsSinceLastPromotion : int  0 4 2 5 1 1 0 0 0 7 ...
##  $ YearsWithCurrManager    : int  3 9 2 7 3 7 3 0 0 7 ...
# Cleaning and Altering data for initial investigation
AttritionData$Sorted_MonthlyIncome = cut(AttritionData$MonthlyIncome, breaks = c(0,4000,8000,12000,16000,20000), labels = c("Under 4k", "4k-8k", "8k-12k", "12k-16k", "16k+"))
AttritionData$Sorted_YearsAtCompany = cut(AttritionData$YearsAtCompany, breaks = c(-1,4,8,12,16,max(AttritionData$YearsAtCompany)), labels = c("Under 4","4-8","8-12","12-16","16+"))
AttritionData$Attrition = factor(AttritionData$Attrition)
AttritionData$JobRole = factor(AttritionData$JobRole)
AttritionData$TotalWorkingYears = factor(AttritionData$TotalWorkingYears)
AttritionData$YearsAtCompany = factor(AttritionData$YearsAtCompany)
AttritionData$YearsWithCurrManager = factor(AttritionData$YearsWithCurrManager)
yes = which(AttritionData$Attrition == "Yes")
AttritionYes = AttritionData[yes,]
table(is.na(AttritionData))
## 
## FALSE 
## 33060

No missing values were found in the data set.

Graphs included in initial investigation

The following code outlines my thought process around my initial investigation into predicting attrition based on 3 variables. This investigation was performed using empirical interpretation without supporting evidence.

# Box plot of Stock Option Level and Monthly Income based on all Yes responses to Attrition
StockIncomeBox = ggplot(data = AttritionYes, aes(x = StockOptionLevel, y = MonthlyIncome)) + geom_boxplot()

# plots exploring relationship between Stock Option Level and Monthly Income
StockHist = ggplot(AttritionData, aes(x = StockOptionLevel, fill = Sorted_MonthlyIncome)) + geom_histogram(stat = "count") + facet_wrap(~Attrition)
StockHist

StockIncomePoint = ggplot(data = AttritionData, aes(x = StockOptionLevel, y = MonthlyIncome, color = Sorted_MonthlyIncome)) + geom_point() + facet_wrap(~Attrition)

# exploring distribution of monthly income and relationship to job level
IncomeHist = ggplot(AttritionData, aes(x = Sorted_MonthlyIncome, fill = Sorted_MonthlyIncome)) + geom_histogram(stat = "count") + facet_wrap(~Attrition)
IncomeHist

SortedIncomePoint = ggplot(AttritionData, aes(x = JobLevel, y = MonthlyIncome, color = Sorted_MonthlyIncome)) + geom_point()
SortedIncomeBox = ggplot(AttritionData, aes(x = MonthlyIncome, y = JobLevel, color = Sorted_MonthlyIncome)) + geom_boxplot()

# Histogram of years at company divided by Attrition
YearHist = ggplot(AttritionData, aes(x = Sorted_YearsAtCompany, fill = Sorted_YearsAtCompany)) + geom_histogram(stat = "count") + facet_wrap(~Attrition)
YearHist

# Histogram of Education divided by Attrition
EducationHist = ggplot(AttritionData, aes(x = AttritionData$Education)) + geom_histogram(stat = "count") + facet_wrap(~Attrition)

# Histogram of Job Satisfaction divided by Attrition
JobSatisfHist = ggplot(AttritionData, aes(x = AttritionData$JobSatisfaction)) + geom_histogram(stat = "count") + facet_wrap(~Attrition)

# Histogram of Job Roles
JobRoleHist = ggplot(AttritionData, aes(x = AttritionData$JobRole)) + geom_histogram(stat = "count") + facet_wrap(~Attrition)

# Histogram of Environment Satisfaction
environmentsat = ggplot(AttritionData, aes(x = AttritionData$EnvironmentSatisfaction)) + geom_histogram(stat = "count") + facet_wrap(~Attrition)

With all the plots above in mind, I chose to further investigate Monthly Income, Stock Option level, and Years at Company.

Verifying investigation

I chose to attempt to use Learning Vector Quantization algorithm to compute the importance of variables within the attrition data set and use a repeated cross validation for verifying the model.

# Using LVQ to predict variables and then performing a cross validation. Importance of the variables are then computed and displayed on a graph

#creating and training the model
  model <- train(Attrition~.,  data=AttritionData[,c(2,3,5,7,8,11,12,14,15,16,18,20,21,22,25,26,27,29,30,31,32,33,34,35,36)], method="lvq", preProcess="scale", trControl=trainControl(method="cv"))

# estimating variable importance using the model
  importance <- varImp(model, scale=FALSE)
  
# summarizing the importance
  print(importance)
## ROC curve variable importance
## 
##   only 20 most important variables shown (out of 24)
## 
##                          Importance
## MonthlyIncome                0.6567
## TotalWorkingYears            0.6563
## YearsAtCompany               0.6470
## StockOptionLevel             0.6455
## JobLevel                     0.6406
## YearsInCurrentRole           0.6403
## YearsWithCurrManager         0.6291
## Age                          0.6265
## JobInvolvement               0.6159
## JobSatisfaction              0.5833
## DistanceFromHome             0.5586
## EnvironmentSatisfaction      0.5532
## WorkLifeBalance              0.5491
## TrainingTimesLastYear        0.5428
## Education                    0.5384
## NumCompaniesWorked           0.5354
## MonthlyRate                  0.5335
## HourlyRate                   0.5282
## DailyRate                    0.5277
## RelationshipSatisfaction     0.5268
# plotting the importance
  plot(importance, main = "Ranking of Variables")

The graph above predicts that the top 3 contributing variables are Monthly Income, Total Working Years, and Years at Company when predicting Attrition.

Creating a Naive Bayes Model

I chose to create a Naive Bayes model using the Monthly Income, Total Working Years, and Years at company. I then wanted to create a model using the best possible seed, so I calculated the best seed based off of the first 50 seeds. I then choose to compare the model using the predicted variables to the model created using my own chosen variables and then build the superior model with the optimum seed value.

## Variables to use: Monthly Income(c20)
## Considerable Variables: Stock Option Level(c29) Years at Company(c33), Total Working Years(c30)

# Testing Accuracy of top 3 predicted variables: Monthly Income, Total Working Years, Years at Company
nums = data.frame(accuracy = numeric(50), seed = numeric(50))
for(i in 1:50)
{
  set.seed(i)
  Indices = sample(seq(1:length(AttritionData$Attrition)),round(.70*length(AttritionData$Attrition)))
  train = AttritionData[Indices,]
  test = AttritionData[-Indices,]
  train = upsample(train, cat_col = "Attrition")
  train = upsample(train, cat_col = "Attrition")
  model = naiveBayes(Attrition ~ MonthlyIncome + TotalWorkingYears + YearsAtCompany, data = train[,c(3,20,33,30)], laplace = 10)
  predictions = predict(model, test)
  CM = confusionMatrix(predictions,test$Attrition)
  nums$accuracy[i] = CM$overall[1]
  nums$seed[i] = i
  nums$Sensitivity[i] = CM$byClass[1]
  nums$Specificity[i] = CM$byClass[2]
}
# Displaying row with highest accuracy
nums[which(nums$accuracy == max(nums$accuracy)),]
##     accuracy seed Sensitivity Specificity
## 20 0.7509579   20   0.7733333   0.6111111
# Displaying Confusion Matrix Statistics for all seeds 1-50
legendcolors <- c("Accuracy" = "#AB0E14", "Sensitivity" = "#F1B11B", "Specificity" = "#957531")
ggplot(nums) + geom_line(aes(x = seed, y = accuracy, color = "Accuracy")) + geom_line(aes(x = seed, y = Sensitivity, color = "Sensitivity")) + geom_line(aes(x = seed, y = Specificity, color = "Specificity")) + labs(x = "Seed", y ="Score", color = "Legend") + scale_color_manual(values = legendcolors) + ggtitle("Confusion Matrix Statistics for 50 Seeds")

# Testing Accuracy of variables chosen from initial investigation: Monthly Income, Stock Option Level, Years at Company
accs = data.frame(accuracy = numeric(50), seed = numeric(50))
for(i in 1:50)
{
  set.seed(i)
  Indices = sample(seq(1:length(AttritionData$Attrition)),round(.70*length(AttritionData$Attrition)))
  train = AttritionData[Indices,]
  test = AttritionData[-Indices,]
  train = upsample(train, cat_col = "Attrition")
  model = naiveBayes(Attrition ~ ., data = train[,c(3,20,33,29)])
  predictions = predict(model, test)
  CM = confusionMatrix(predictions,test$Attrition)
  accs$accuracy[i] = CM$overall[1]
  accs$seed[i] = i
  accs$Sensitivity[i] = CM$byClass[1]
  accs$Specificity[i] = CM$byClass[2]
}
# Displaying highest accuracy for comparison
accs[which(accs$accuracy == max(accs$accuracy)),]
##    accuracy seed Sensitivity Specificity
## 20 0.697318   20   0.7200000   0.5555556
## 41 0.697318   41   0.7196262   0.5957447
## Model Using Monthly Income, Total Working Years, and Years at Company with seed 20
set.seed(20)
#creating training and test sets
Indices = sample(seq(1:length(AttritionData$Attrition)),round(.70*length(AttritionData$Attrition)))
train = AttritionData[Indices,]
test = AttritionData[-Indices,]
#upsampling Attrition column for model
train = upsample(train, cat_col = "Attrition")
train = upsample(train, cat_col = "Attrition")
model = naiveBayes(Attrition ~ ., data = train[,c(3,20,33,30)])
predictions = predict(model, test)
confusionMatrix(predictions,test$Attrition)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  175  14
##        Yes  50  22
##                                          
##                Accuracy : 0.7548         
##                  95% CI : (0.698, 0.8057)
##     No Information Rate : 0.8621         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.2739         
##                                          
##  Mcnemar's Test P-Value : 1.214e-05      
##                                          
##             Sensitivity : 0.7778         
##             Specificity : 0.6111         
##          Pos Pred Value : 0.9259         
##          Neg Pred Value : 0.3056         
##              Prevalence : 0.8621         
##          Detection Rate : 0.6705         
##    Detection Prevalence : 0.7241         
##       Balanced Accuracy : 0.6944         
##                                          
##        'Positive' Class : No             
## 
# Writing attrition predictions to a new CSV file with the test data set
test$PredictedAttrition = predictions
AttritionPredictions = test[,c(1,3,20,30,33,39)]
write.csv(AttritionPredictions,"/Users/chadkwong/Desktop/SMU/GitHub/Case-Study-2/Attrition Predictions.csv", row.names = FALSE)

The above optimized Naive Bayes model provides an accuracy of ~75%, sensitivity of ~.7778, and specificity of ~.6111. With a small P-value of 1.2e-05, there is enough evidence to support the claim that Monthly Income, Years at Company, and Total Working Years are considerable variables when predicting Attrition.

Predicting Monthly Income Using Variables from Model

When predicting monthly income, I chose to use the variables used in my previous model NB model. Using a linear regression model, I predicted the relationship between Monthly Income, Years at Company, and Total Working Years. I then perform a Leave One Out Cross Validation on the fit used and display the relationship between all 3 variables within a 3d scatter plot.

# Regression Model for Predicting monthly income
fit = lm(MonthlyIncome ~ TotalWorkingYears + YearsAtCompany, data = AttritionData)
summary(fit)
## 
## Call:
## lm(formula = MonthlyIncome ~ TotalWorkingYears + YearsAtCompany, 
##     data = AttritionData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8819.4 -1414.2  -150.9  1162.4 10056.2 
## 
## Coefficients: (1 not defined because of singularities)
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1653.000    927.131   1.783 0.074978 .  
## TotalWorkingYears1    659.540   1178.258   0.560 0.575801    
## TotalWorkingYears2    971.921   1236.126   0.786 0.431945    
## TotalWorkingYears3   1056.231   1168.557   0.904 0.366333    
## TotalWorkingYears4   2309.748   1154.580   2.001 0.045782 *  
## TotalWorkingYears5   2141.839   1128.046   1.899 0.057961 .  
## TotalWorkingYears6   2652.052   1114.974   2.379 0.017613 *  
## TotalWorkingYears7   2396.310   1126.791   2.127 0.033753 *  
## TotalWorkingYears8   2418.744   1107.618   2.184 0.029271 *  
## TotalWorkingYears9   4780.867   1132.237   4.222 2.69e-05 ***
## TotalWorkingYears10  4432.998   1107.570   4.002 6.85e-05 ***
## TotalWorkingYears11  4556.978   1226.265   3.716 0.000216 ***
## TotalWorkingYears12  4700.550   1197.884   3.924 9.46e-05 ***
## TotalWorkingYears13  4728.396   1184.442   3.992 7.15e-05 ***
## TotalWorkingYears14  4758.494   1215.498   3.915 9.82e-05 ***
## TotalWorkingYears15  6337.903   1179.512   5.373 1.01e-07 ***
## TotalWorkingYears16  7563.668   1207.858   6.262 6.19e-10 ***
## TotalWorkingYears17  5030.989   1236.381   4.069 5.19e-05 ***
## TotalWorkingYears18  6339.018   1285.335   4.932 9.91e-07 ***
## TotalWorkingYears19  4591.233   1293.715   3.549 0.000409 ***
## TotalWorkingYears20  5763.967   1273.193   4.527 6.89e-06 ***
## TotalWorkingYears21 15677.783   1313.646  11.935  < 2e-16 ***
## TotalWorkingYears22 14399.083   1264.673  11.386  < 2e-16 ***
## TotalWorkingYears23 12739.309   1301.595   9.787  < 2e-16 ***
## TotalWorkingYears24 13856.048   1434.186   9.661  < 2e-16 ***
## TotalWorkingYears25 13421.651   1607.448   8.350 2.99e-16 ***
## TotalWorkingYears26 17108.901   1431.587  11.951  < 2e-16 ***
## TotalWorkingYears27 17998.967   1866.429   9.644  < 2e-16 ***
## TotalWorkingYears28 13427.767   1500.291   8.950  < 2e-16 ***
## TotalWorkingYears29 14815.141   1721.993   8.603  < 2e-16 ***
## TotalWorkingYears30 12357.986   1573.939   7.852 1.32e-14 ***
## TotalWorkingYears31 11987.898   1724.372   6.952 7.47e-12 ***
## TotalWorkingYears32 13962.825   1722.533   8.106 1.95e-15 ***
## TotalWorkingYears33 14244.903   2340.343   6.087 1.79e-09 ***
## TotalWorkingYears34 17576.915   2698.136   6.514 1.29e-10 ***
## TotalWorkingYears35 16645.812   2060.670   8.078 2.42e-15 ***
## TotalWorkingYears36 17157.989   1683.709  10.191  < 2e-16 ***
## TotalWorkingYears37 13940.874   2916.218   4.780 2.08e-06 ***
## TotalWorkingYears40  8659.000   2622.322   3.302 0.001002 ** 
## YearsAtCompany1      -191.915    635.111  -0.302 0.762596    
## YearsAtCompany2        -2.334    625.938  -0.004 0.997026    
## YearsAtCompany3       -68.030    606.624  -0.112 0.910737    
## YearsAtCompany4      -755.249    642.173  -1.176 0.239911    
## YearsAtCompany5      -175.982    598.075  -0.294 0.768645    
## YearsAtCompany6       403.990    667.890   0.605 0.545433    
## YearsAtCompany7      -811.513    669.397  -1.212 0.225753    
## YearsAtCompany8       478.493    654.639   0.731 0.465037    
## YearsAtCompany9       -18.191    667.770  -0.027 0.978274    
## YearsAtCompany10     -345.434    651.363  -0.530 0.596034    
## YearsAtCompany11      -99.903    856.933  -0.117 0.907220    
## YearsAtCompany12    -1369.154   1024.069  -1.337 0.181611    
## YearsAtCompany13      682.476    803.520   0.849 0.395936    
## YearsAtCompany14     1132.948    912.088   1.242 0.214545    
## YearsAtCompany15    -2150.742    998.578  -2.154 0.031553 *  
## YearsAtCompany16      697.126   1275.835   0.546 0.584938    
## YearsAtCompany17    -1528.814   1291.647  -1.184 0.236916    
## YearsAtCompany18     -138.046   1086.428  -0.127 0.898921    
## YearsAtCompany19     -324.288   1139.371  -0.285 0.776009    
## YearsAtCompany20    -2790.696    971.153  -2.874 0.004166 ** 
## YearsAtCompany21     -670.088   1266.528  -0.529 0.596901    
## YearsAtCompany22      719.830   1013.225   0.710 0.477641    
## YearsAtCompany23    -2138.651   2782.323  -0.769 0.442324    
## YearsAtCompany24    -3845.496   1503.802  -2.557 0.010736 *  
## YearsAtCompany25     1835.225   2054.584   0.893 0.372000    
## YearsAtCompany26    -4386.670   1726.836  -2.540 0.011264 *  
## YearsAtCompany30    -4370.825   2850.360  -1.533 0.125565    
## YearsAtCompany31     5072.102   2263.269   2.241 0.025296 *  
## YearsAtCompany32     1775.483   1985.653   0.894 0.371507    
## YearsAtCompany33    -2441.915   3070.646  -0.795 0.426707    
## YearsAtCompany40           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2453 on 801 degrees of freedom
## Multiple R-squared:  0.7376, Adjusted R-squared:  0.7154 
## F-statistic: 33.12 on 68 and 801 DF,  p-value: < 2.2e-16
preds = predict(fit)

# Performing LOO Cross Validation for both models
train(MonthlyIncome ~ TotalWorkingYears + YearsAtCompany, data = AttritionData, method = "lm", trControl = trainControl(method = "LOOCV"))
## Linear Regression 
## 
## 870 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 869, 869, 869, 869, 869, 869, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2742.508  0.6473053  2004.609
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
surface_plot <- plot_ly(AttritionData, 
                     x = ~TotalWorkingYears,
                     y = ~YearsAtCompany, 
                     z = ~MonthlyIncome,
                     name = "Attrition Data", 
                     type = "scatter3d",
                     mode = "markers",
                     size = .75)

# Adding Title to Graph and proper labels for axis
surface_plot <- surface_plot %>% layout(title = "Relationship Between Monthly Income, Years At Company, and Total Working Years", xaxis = list(title = 'Total Working Years',zeroline = TRUE), yaxis = list(title = 'Years at  Company'), zaxis = list(title = 'Monthly Income'))

# Adding in predicted Monthly Income
surface_plot <-surface_plot %>% add_trace(z = ~preds, name = "Predicted Monthly Income", mode = "markers")

# Displaying Plot
surface_plot

Using the model, we can observe a RMSE of ~2743 which is within the threshold of 3000. A p-value of 2.2e-16 is observed as well. From the plot above, we can also see that the predicted monthly income falls within the distribution of the actual data set. I can thus conclude that the fit supports the relationship between Monthly Income, Years at Company, and Total Working Years. However, I speculate that there is not enough information to confirm that Monthly income is solely dependent on Years At Company and Total Years Working